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MONTE CARLO LIKELIHOOD INFERENCE FOR MISSING 

DATA MODELS 
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We describe a Monte Carlo method to approximate the maxi- 
mum hkehhood estimate (MLE), when there are missing data and 
the observed data likehhood is not available in closed form. This 
method uses simulated missing data that are independent and iden- 
tically distributed and independent of the observed data. Our Monte 
Carlo approximation to the MLE is a consistent and asymptotically 
normal estimate of the minimizer 9* of the KuUback-Leibler informa- 
tion, as both Monte Carlo and observed data sample sizes go to infin- 
ity simultaneously. Plug-in estimates of the asymptotic variance are 
provided for constructing confidence regions for 6*. We give Logit- 
Normal generalized linear mixed model examples, calculated using 
an R package. 

1. Introduction. Missing data [20] either arise naturally — data that might 
have been observed are missing — or are intentionally chosen — a model in- 
cludes random variables that are not observable (called latent variables or 
random effects). A normal mixture model or a generalized linear mixed 
model (GLMM) is an example of the latter. In either case, a model is spec- 
ified for the complete data {x,y), where x is missing and y is observed, by 
their joint density fg{x,y), also called the complete data likelihood (when 
considered as a function of 9). The maximum likelihood estimator (MLE) 
maximizes the marginal density fe{y), also called the observed data likeli- 
hood (when considered as a function of 9). This marginal density is only 
implicitly specified by the complete data model, fe{y) = J feix,y)dx, and is 
often not available in closed form. This is what makes likelihood inference 
for missing data difficult. 

Many Monte Carlo methods for approximating the observed data likeli- 
hood in a missing data model have been proposed. In these, missing data are 
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simulated by either ordinary Monte Carlo [17, 24] or Markov chain Monte 
Carlo (MCMC) [10, 12, 18, 28, 29]. To get a good approximation of the 
likelihood over a large region, umbrella sampling [13, 30] may be necessary. 
There are also many Monte Carlo methods for maximum likelihood with- 
out approximating the likelihood: stochastic approximation [23, 38], Monte 
Carlo EM [14, 34], and Monte Carlo Newton-Raphson [25]. There are also 
non-Monte Carlo methods for maximum likelihood without approximating 
the likelihood: EM [8] and analytic approximation [6]. There are so many 
methods because each has its strength and weakness. In theory, Monte Carlo 
methods work for complicated problems but require very careful calibration, 
whereas non-Monte Carlo methods are relatively easier to implement but 
only apply to simple cases. All are useful for some, but not all, problems. 

This article is concerned with a Monte Carlo approximation of the ob- 
served data likelihood and asymptotic properties of the maximizer of our 
Monte Carlo likelihood. Our method uses simulated missing data that are 
independent and identically distributed (i.i.d.) and independent of the ob- 
served data. It approximates the likelihood over the entire parameter space. 
We give Logit-Normal GLMM examples to illustrate the case when our 
asymptotic normality holds and to show the value of our Monte Carlo like- 
lihood approximation even when asymptotic normality does not hold. 

Let the observed data Yi, . . . ,1^ be i.i.d. from a density g, which is not 
assumed to be some fg. We do not assume the model is correctly specified, 
since this increase of generality makes the theory no more difficult. The MLE 
9n is a maximizer of the log likelihood 



n 



(1) U9) = ^logfeiY,). 

i=i 

In our method, we generate an i.i.d. Monte Carlo sample Xi, . . . ,Xm, inde- 
pendent of li, ... ,1^, from an importance sampling density h and approxi- 
mate fg{y) by 



feiX^,y) 



^ rn 

This makes heuristic sense because 

by the strong law of large numbers. (The subscript m on the arrow means 
as m goes to infinity. Similarly, a subscript m, n means as both m and n go 
to infinity.) Our estimate of On is the maximizer ^ of our Monte Carlo 
log likelihood 

n 

(3) lm,n{G)=Y,^Ogfe,miYj), 
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an approximation to /ri(^) with /e^^ replacing fg. We call 9m,n the Monte 
Carlo MLE (MCMLE). 

Under the conditions of Theorem 2.3, the MCMLE 

4 ^m.n-AA r, + 

\ n m 

for sufficiently large Monte Carlo sample size m and observed data sample 
size n, where 6* is the minimizer of the Kullback-Leibler information 

(5) K{e)=E,logj^ 

J is minus the expectation of the second derivative of the log likelihood, V is 
the variance of the first derivative of the log likelihood (score) , and W is the 
variance of the deviation of the score from its Monte Carlo approximation 
[given by (7) below]. Under certain regularity conditions [15, 35], 

(6) e^^N(e*,:L_lJ 

\ n 

We see that ^m,n has nearly the same distribution when the Monte Carlo 
sample size m is very large. If the model is correctly specified, that is, g = fe^, 
then 9* = 6q and J = V, either of which is called Fisher information, and (6) 
becomes 

e^^Mle*,— 

\ n 

the familiar formula due to Fisher and Cramer. This replacement of J^^ 
by the so-called "sandwich" J~'^V is the only complication arising from 
model misspecification [19]. 

The first term of the asymptotic variance in (4) is what would be the 
asymptotic variance if we could use the exact likelihood rather than Monte 
Carlo. Hence it is the same as the asymptotic variance in (6). The second 
term is additional variance due to Monte Carlo. Increasing the Monte Carlo 
sample size m can make the second term as small we please so that the 
MCMLE &m.,n is almost as good as the MLE On- In (4), W is the only term 
related to the importance sampling density h that generates the Monte Carlo 
sample. Choosing an h that makes W smaller makes Ora,n more accurate. 

The asymptotic distribution of O^.n in (4) is a convolution of two inde- 
pendent normal distributions. The proof of this is not simple, however, for 
three reasons. First, the finite sample terms from which these arise [the two 
terms on the right-hand side in (9) below] are dependent. Second, one of 
these is itself a sum of dependent terms, because each term in (3) uses the 
same X's. Third, our two sample sizes m and n go to infinity simultaneously, 
and we must show that the result does not depend on the way in which m 
and n go to infinity. 
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2. Asymptotics of 9m,n- In this section, we state theorems about strong 
consistency and asymptotic normahty of the MCMLE O^.n- Proofs are in 
the Appendix. 

We use empirical process notation throughout. We let P denote the prob- 
ability measure induced by the importance sampling density /i, and we let 
Prra denote the empirical measure induced by Xi, . . . , Xm (that are i.i.d. from 
P). Similarly, we let Q denote the probability measure induced by the true 
density g and denote the empirical measure induced by Yi , . . . , y„ (that 
are i.i.d. from Q). Given a measurable function / : ^ i— > R, we write Pm/(-'^) 
for the expectation of / under and Pf[X) for the expectation under P. 
Similarly we use Q„/(y) and Qf{Y). Note that Pm/(X) = ^YZi fi^i) is 
just another notation for a particular sample mean. 

The Kullback-Leibler information in (5) is written as K{9) = Q\og[g{Y)/ 
fg(Y)], its empirical version as Kn{9) = '^n^og[g{Y) / fgiY)] and our approx- 
imation to Kn{9) as 

Km,n{0)=Qnlog[g{Y)/fe,^{Y)] 
with fe,m{y)=^mfe{X,y)/h{X). Then Kn{9) = Qnlog g{Y) - ln{9)/n and 
Km,n{Q) = ^n^ogg{Y) — lm,n{^)/n. Hence the MLE 9n is the minimizer of 
Kn and the MCMLE 

Gm,n is the minimizer of By Jensen's inequality 

K{9) > 0. This allows K{9) = oo for some 9, but we assume K{9*) is finite. 
[This excludes only the uninteresting case of the function 9 i-^ K{9) being 
identically oo.] 

2.1. Epi-convergence of Km,n- To get the convergence of 9m,n to 9* we 
use epi-convergence of the function Km^n to the function K. Epi-convergence 
is a "one-sided" uniform convergence that was first introduced by Wijsman 
[36, 37], developed in optimization theory [2, 3, 26] and used in statistics 
[11, 12]. It is weaker than uniform convergence yet insures the convergence 
of minimizers as the following proposition due to Attouch [2], Theorem 1.10, 
describes. 

Proposition 2.1. Let X he a general topological space, {/n} a sequence 
of functions from X to M that epi-converges to f , and a sequence 

of points in X satisfying fn{xn) < inf fn + En with e„ [ 0. Then for every 
converging subsequence Xn^. — > xq 

f{xo)= inf / = lim/„^,(3;„J. 

If / has a unique minimizer x, then x is the only cluster point of the 
sequence Otherwise, there may be many cluster points, but all must 

minimize /. There may not be any convergent subsequence. If the sequence 
{xn} is in a compact set and X is sequentially compact, however, there is 
always a convergent subsequence. 
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Theorem 2.2. Let {fg{x,y) -.O £ &}, where QcW'-, be a family of den- 
sities with respect to a a -finite measure fix u on X x y, let Xi, X2, . . . be 
i.i.d. from a probability distribution P that has a density h with respect to 
fi, and let Yi, Y2, . . . be i.i.d. from a probability distribution Q that has a 
density g with respect to v . Suppose: 

(1) Q is a second countable topological space; 

(2) for each {x,y), the function 6 fg(^x,y) is upper semicontinuous on 

Q; 

(3) for each 9, there exists a neighborhood Bg of 6 such that 



Qlog 



Psup/4x,y)//i(x)5(r) 



< co: 



(4) for each 6, there exists a neighborhood Cg of 9 such that for any 
subset B ofCg, the family of functions {svip^^B f<i>{-^y)/K-)9{y)'-V is 
P- Glivenko-Cantelli; 

(5) foreachO, the family of functions {fg(-\y)/h{-) ly £ y} is P-Glivenko- 
Cantelli. 

Then n epi-converges to K with probability one. 



Glivenko-Cantelli means a family of functions for which the uniform 
strong law of large numbers holds ([32], page 81). Conditions (1) through 
(3) are similar to those of Theorem 2 in [12]. Also they are vaguely similar 
to those in [33], which imply epi-convergence of Kn to K (when there are 
no missing data and no Monte Carlo). 

2.2. Asymptotic normality of 9m,n- The following theorem assumes that 
the local minimizer 9* of K is an interior point of Q and that K is differ- 
entiable. Hence VK{9*) =0, where V means differentiation with respect to 
9. 



Theorem 2.3. Let {fg{x,y) :6' G 6}, where B C , be a family of den- 
sities with respect to a a -finite measure ^x u on X x y, let Xi, X2, . . . be 
i.i.d. from a probability distribution P that has a density h with respect to 
/i, and let Yi, Y2,... be i.i.d. from a probability distribution Q that has a 
density g with respect to v. Suppose: 

(1) second partial derivatives of fg{x,y) with respect to 9 exist and are 
continuous on O for all x and y, and may be passed under the integral sign 
in J fg{x\y)di2{x); 

(2) y is a separable metric space and y V fg* {x\y) is continuous for 
each x; 
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(3) there is an interior point 9* of such that QV log fe* (Y) = 0, V = 
vaiQ V log /e* (y ) is finite and J = —QV'^ log fg*(Y) is finite and nonsingu- 
lar; 

(4) there exists a p> such that Sp = {6 -.{O — 6*\ < p} is contained in 
and T\ = {V^/e(-) : 6 G Sp} is Q-Glivenko-Cantelli; 

(5) T2 = {fe*{-\y)/h{-):y e y} is P-Glivenko-Cantelli; 

(6) ^3 = {V/e* :y € 3^} is P-Donsker and its envelope function 
F has a finite second moment; 

(7) = {V^fe{-\y)/h{-):ye y,9e Sp} is P-Glivenko-Cantelli; 

(8) there is a sequence 6m,n which converges to 9* in probability such that 

Then 

(7) W = yavpQVfg,{X\Y)/h{X) 
is finite and 

(8) - + — - ^m,n M{^, I). 

\n m / 

Donsker means a family of functions for which the uniform central limit 
theorem holds ([32], page 81). Note J^s is a family of vector-valued functions 
and J-'i and J-'^ are families of matrix-valued functions. Such families are 
Glivenko-Cantelli or Donsker if each component is ([31], page 270). Con- 
ditions (1), (3), (4) and (8) are similar to the usual regularity conditions 
for asymptotic normality of the MLE, which can be found, for example, in 
[9], Chapter 18. For a correctly specified model, differentiability under the 
integral sign in 1 = // fQ{x,y) dfi{x) diy{y) implies conditions (1) and (3). 
Condition (4) holds if functions in J^i are dominated by a L^{Q) function 
because Sp is compact ([9], Theorem 16(a)). 

Under smoothness conditions imposed in this theorem, the asymptotics 
of ^m,,n arises from the asymptotics of 

(9) VKra,nm = -Q^V log fe* (Y) - logFmfe* {X\Y) / h{X) . 

The two terms on the right-hand side are dependent and the summands 
in the second term are dependent, which indicates the complexity of this 
problem and why the usual asymptotic arguments do not work here. The 
asymptotics for the first term follow from the central limit theorem. The 
asymptotics for the second term (Lemma A. 4) go as shown below: 

m 

^MQnVlogP^/,.(Xly)//i(X) ^ QnGpVfe'iX\Y)/hiX) 




QGpVfe^{X\Y)/h{X) 
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We first let m — > oo then n — > oo. A uniformity argument then makes the 
result the same when m and n go to infinity simultaneously. The part 
is weak convergence of the empirical process ■y/m(Pm — -P) to a tight Gaus- 
sian process Gp. The part is the law of large numbers. Integration over 
sample paths of Gp gives the distribution of the limit. The asymptotic in- 
dependence between the two terms in (9) comes from the fact that the law 
of large numbers eliminates the randomness coming from Q„. 

2.3. Plug-in estimates for J, V and W . We can construct a confidence 
region for 6* using (4) or (8). If we can evaluate the integrals defining J, 
V and W, then we may use those integrals with 6m,n plugged in for 9* to 
estimate them, assuming enough continuity. Often we cannot evaluate the 
integrals or do not know g. Then we use their sample versions, 

1 " , 

Jm,n = V log {Yj ) , 

(10) V _^viog/^ (>S)Vlog/^ {Yjf, 

i=i 

-j^ m 

m ~ 

1=1 

where 

1 " 

(11) S, = -J2Vfs^^^iX,\Yj)/h{X,). 

Often these cannot be used as shown because feiy) and fe{x\v) are not 
available in closed form. Then we replace fe{y) by fe,m{y) defined in (2) and 
fe{x\y) by fe{x,y)/ fe,m{y)- The resulting variance estimate Jm]n{^ni,n/n + 
^m,n/iT^)Jm\ sandwich estimator. 

2.4. An alternative Monte Carlo scheme. Each term in (3) uses the same 
X's. An alternative is to use each X once, generating a new sample for each 
term in (3). Then the resulting estimate has the same asymptotic variance 
as in (4) or (8) except that W is replaced by W = Qvar/^ V/e*(X|y)//i(X). 
By Jensen's inequality, W > W. Thus using the X's n times makes 9m,n 
more accurate. 

3. Logit-Normal GLMM examples. The Logit-Normal GLMM refers to 
Bernoulli regression with normal random effects. It has a linear predictor of 
the form 



(12) 



T] = XP+Zb, 
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where X and Z are known design matrices, and (3 and h are unknown vec- 
tors (fixed effects and random effects, resp.). Tlie observed data consist of 
n i.i.d. responses, one for each individual, and the missing data consist of 
n i.i.d. random effects vectors, one for each individual with 6~ AA(0, S) (we 
denote the missing data by 6, instead of x, to avoid confusion with X). The 
observed data for one individual is a vector whose components are indepen- 
dent Bernoulli given 6, with success probability vector having components 
\og\i~^{r]k) = 1/(1 + exp(— r/fc)). The unknown parameters to be estimated 
are [3 and the parameters determining the variance S of random effects, 
which typically has simple structure and involves only a few parameters. 
We reparametrized (12) as 

(13) ri = Xl3 + Z/\b, 

where A is a diagonal matrix whose diagonal is a vector of unknown pa- 
rameters (square roots of variance components) and 6 is a standard normal 
random vector (whose distribution contains no unknown parameters) . All of 
the unknown parameters are in (3 and A in the linear predictor (13). This 
representation is flexible enough to include the examples in this article. We 
used the standard normal density (which is the true density of h) as our 
importance sampling density. This makes sense because of our reparameti- 
zation to make the density of b not depend on the parameters. This is not 
a general recommendation of the normal density. 

We wrote an R package bernor that implements the methods of this arti- 
cle for the Logit-Normal GLMM (available at www . stat . umn . edu/ geyer/bernor) . 
The web page also contains detailed verification of the conditions of our 
theorems for the model and detailed descriptions of its applications to our 
examples. 

3.1. Conditions of the theorems. The Logit-Normal GLMM with our 
importance sampling density satisfies the conditions of both theorems. Ver- 
ifying the conditions is straightforward because of two properties. First, the 
sample space y is finite. Thus verifying Glivenko-Cantelli in conditions (4) 
and (5) of Theorem 2.2 and condition (6) of Theorem 2.3 reduces to just 
verifying that functions are L^(P), and verifying Donsker in condition (5) of 
Theorem 2.3 reduces to just verifying that functions are L?'{P). Also verify- 
ing Glivenko-Cantelli in condition (7) of Theorem 2.3 reduces to just verify- 
ing that for each y, the class {V"^ f Q{-\y) / h{-) : 9 G Sp} is P-Glivenko-Cantelli. 
This can be verified like condition (4) of Theorem 2.3 as discussed after the 
theorem. Second, our importance sampling density h is the marginal density 
of the missing data and this implies f0{b,y)/h{b) = fe{y\b), which makes it 
easy to verify that functions are or L^. Differentiability under the integral 
sign twice follows from h having two moments. 




Fig. 1. Monte Carlo profile log likelihood (A) and nominal 95% confidence ellipses (B) 
for the Booth and Hobert data using m = 10'' . Solid dot and solid line are the MCMLE 
and confidence ellipse using plug-in estimates of J, V and W at the MCMLE. Hollow dot 
and dashed line are the MLE and confidence ellipse using Fisher information and exact 
W at the MLE. Square and dotted line are the '^simulation truth" parameter value and 
confidence ellipse using Fisher information and exact W at the simulation truth. The last 
two assume V = J . 



3.2. Data from McCulloch^ s model. We use a data set given by Booth 
and Hobert [5], Table 2, that was simulated using a model from [22]. This 
model corresponds to a Logit-Normal GLMM with one-dimensional (3 and 
h in (12), and its log likelihood can be calculated exactly by numerical in- 
tegration. The observed data consist of ten i.i.d. vectors of length 15. The 
parameters that generated the data are /3 = 5 and a = a/1/2. 

Using a Monte Carlo sample of size 10^, we approximated the observed 
data log likelihood and obtained the MCMLE. The Monte Carlo profile log 
likelihood for a (Figure lA) indicates that the log likelihood is well behaved, 
quadratic around the MLE, and that our MCMLE (/3m, n = 6.15, a"m,n = 1.31) 
is very close to the MLE (/3„ = 6.13, (T„ = 1.33). Using plug-in estimates given 
by (10), we also obtained a nominal 95% confidence ellipse for the true 
parameter (the solid ellipse in Figure IB). For comparison, we obtained two 
other confidence ellipses using the theoretical expected Fisher information 
and W at the MLE (the dashed ellipse in Figure IB) and also at the true 
parameter (the dotted ellipse in Figure IB). Both these exact evaluations 
took 13 hours, whereas our plug-in estimates took two and-a-half minutes. 
Our MCMLE and the MLE are not close to the truth, and these ellipses are 
different, indicating that an observed data sample size n = 10 is too small 
to apply asymptotics. But our MCMLE is close to the MLE, indicating that 
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Fig. 2. Sampling distribution of the MCMLE. Hollow dots are the MCMLE's for 100 
simulated data sets, using sample sizes n = 500 and m = 100. The solid dot is the ^^sim- 
ulation truth" parameter value. The solid curve is the asymptotic 95% coverage ellipse. 
The dashed curve is what the 95% coverage ellipse would be if m were infinity. 



our Monte Carlo sample size m = 10^ is good enough for estimating the 
MLE for the observed data. 

3.3. Simulation for McCulloch^ s model. To demonstrate our asymptotic 
theory, we did a simulation study using the same model with sample sizes 
n = 500 and m = 100. [We chose these sample sizes so that the two terms 
that make up the variance in (4) have roughly the same size.] Figure 2 gives 
the scatter plot of 100 MCMLE's. The solid ellipse is an asymptotic 95% 
coverage ellipse using the theoretical expected Fisher information and W . 
The dashed ellipse is what we would have if we had very large Monte Carlo 
sample size m, leaving n the same. The solid ellipse contains 92 out of 100 
points, thus asymptotics appear to work well at these sample sizes. 

3.4. The influenza data. Table 1 in [7] shows data collected from 263 
individuals about four influenza outbreaks from 1977 to 1981 in Michigan. 
Thus the observed data consist of 263 i.i.d. vectors of length four. Coull and 
Agresti [7] used a Logit-Normal GLMM with four-dimensional (3 and b in 
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Fig. 3. Monte Carlo profile log likelihood using m = 10*. For each a, other parameters 
are maximized. The solid dot is the MLE reported by Coull and Agresti [7] . Leftmost point 
(a = Q) corresponds to the MLE for the model without random effects. 



Pi Pi 
Pi P2 
1 P2 
P2 1 . 

and reported the MLE as (3 = (-4.0, -4.4, -4.7, -4.5), a = 4.05, pi = 0.43 
and p2 = —0.25. Our reparametrization (13) that corresponds to this model 
has four-dimensional identity matrix X, four-dimensional (3, 

/I 1 1 0\ 
11 1 10 
~ 1 1 10' 

Vi -1 1/ 

six-dimensional diagonal matrix A with diagonal elements 51,62,6^,63,63,53, 
and six-dimensional b. 

Using a Monte Carlo sample of size 10^, we approximated the observed 
data log likelihood and found a ridge in the log likelihood surface (Figure 3). 



(12) and b having variance matrix 

1 



a 



Pi 
Pi 
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Pi 
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Pi 

P2 
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Table 1 

Parameter values along the ridge of the Monte Carlo log likelihood for the Influenza 
Data, using Monte Carlo sample size m = 10^ . The MLE from [7] is provided in the last 

row for comparison 



<J 


Pi 


P2 




/3 






MC log likelihood 


1.60 


0.79 


-0.47 


-2.1 


-2.3 


-2.5 


-2.4 


-448.717 


2.00 


0.64 


-0.38 


-2.4 


-2.6 


-2.8 


-2.7 


-448.682 


3.00 


0.48 


-0.28 


-3.2 


-3.5 


-3.7 


-3.6 


-448.646 


4.00 


0.43 


-0.25 


-4.0 


-4.4 


-4.6 


-4.5 


-448.635 


5.00 


0.40 


-0.23 


-4.8 


-5.3 


-5.6 


-5.5 


-448.631 


6.00 


0.39 


-0.22 


-5.7 


-6.2 


-6.6 


-6.4 


-448.629 


4.05 


0.43 


-0.25 


-4.0 


-4.4 


-4.7 


-4.5 


-448.646 



(Monte Carlo sample size 10^ gave results identical to three decimal places.) 
The log likelihood is strongly curved in directions orthogonal to the ridge 
but hardly changes along the ridge. Fisher information is nearly singular 
because of this ridge. Parameter values along the ridge (Table 1) vary over a 
large range, and the bigger a is the more extreme the components of /3 are. 
This is a surprise because sample size 263 is usually large enough for making 
inference about seven parameters. Even though the model is identifiable, it 
is not clear that asymptotics would hold for any sample size. Hence some 
penalized likelihood method should probably be used. 

3.5. The salamander data. We use the data in [21], Section 14.5, that 
were obtained from a salamander mating experiment and have been ana- 
lyzed many times (see [5], for one analysis and citations of others). This 
example has been considered difficult to analyze because its likelihood in- 
volves a 20-dimensional integral. We use "Model A" of Karim and Zeger 
[16], which corresponds to a Logit-Normal GLMM with four-dimensional 
j3 and 20-dimensional b in (12) with two parameters determining the vari- 
ance of b. The observed data consist of three i.i.d. vectors of length 120. 
The MLE given by Booth and Robert [5] is /3 = (1.03, 0.32, -1.95, 0.99) and 
a = (1.18,1.12). Based on Monte Carlo sample size 10^, our MCMLE was 
Pm.n = (1.00,0.53, —1.78, 1.27) and am,n = (1-10, 1.17), and the standard er- 
rors were (0.35, 0.33, 0.36, 0.53) for (3m,n and (0.20, 0.28) for am,n- Our 
method did not work well for these data, and these standard errors give a 
clear indication of the accuracy of our MCMLE. 

4. Discussion. We have described a Monte Carlo method to approximate 
the observed data likelihood and the MLE when there are missing data 
and the observed data likelihood is not available in closed form. The MLE 
converges to the minimizer 6* of the Kullback-Leibler information, which 
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is the true parameter value when the model is correctly specified. We have 
proved that our MCMLE is a consistent and asymptotically normal estimate 
of 9* as both Monte Carlo and observed data sample sizes go to infinity 
simultaneously. Plug-in estimates of the asymptotic variance are provided 
in (10) for constructing confidence regions for 0* . 

We have presented the theory so that it can be used for studying model 
misspecification in missing data models. In practice, a statistical model fe is 
often chosen only for mathematical convenience and may contain simplistic 
and unrealistic assumptions. However, it is usually possible to simulate i.i.d. 
data y's from a more realistic model g. The theory applies whether the Y's 
are a Monte Carlo sample or real data. In either case we can estimate 6* using 
6m,n and know what accuracy we have. By comparing (an estimate of 

fg* , the "best" approximation to g in the model) with g, we can assess model 
validity as whether the particular model is reasonable for approximating the 
truth or how its simplifying assumptions influence scientific conclusions. 

Our applications to the Logit-Normal GLMM examples illustrate advan- 
tages and disadvantages of our method. First, our method uses ordinary 
(independent sample) Monte Carlo, thus is simpler to implement and easier 
to understand than MCMC. Second, it always provides accurate standard er- 
rors, and they give a clear indication of when the method works and how well. 
MCMC methods require more careful tuning and do not provide analogous 
standard errors. MCMC diagnostics are widely used but give no guarantees, 
and convergence proofs are very difficult except for simple applications and 
are not widely used. Third, our method approximates the likelihood over 
the entire parameter space. We have seen the advantage of such likelihood 
evaluation for the influenza data in Section 3.4. One can assess whether the 
likelihood is well behaved so that appropriate inference can be based on the 
MLE. The only disadvantage of our method arises from its simple Monte 
Carlo scheme. It does not work well with high-dimensional missing data as 
in the salamander data in Section 3.5. 

Our method is based on sampling from an importance sampling density 
h. In theory, we want the optimal h that makes W as small as possible so 
that the MCMLE is as accurate as possible. The form of W in (7) says 
that we want h{x) to be high where QV fe*{x\Y) is high. In very simple 
situations we can find such h (Sung [27] finds the optimal h for a normal 
mixture model). In complicated situations, just as in ordinary importance 
sampling, one cannot calculate the optimal h and must proceed by trial and 
error. 

Asymptotic theory analogous to ours does not exist for MCMC. It in- 
volves three quantities: the MCMLE O^.n is a function of both simulated 
missing data and observed data, the MLE On (which cannot be calculated 
exactly) is a function of observed data only, and 6* is the true parame- 
ter value. Geyer [12] provides asymptotic theory for 6m,n — Gn conditional 
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on observed data, accounting for only Monte Carlo variability, not sampling 
variability. Classical theory of maximum likelihood provides asymptotic the- 
ory for 9n — 9*, accounting for sampling variability. As we have seen in this 
article, it is not easy to combine these two sources of variability, and this 
has not been tried for MCMC. Our method could be extended so that the 
importance sampling density can depend on observed data, which is usu- 
ally done in MCMC. We suppose the theory for that would be considerably 
more complicated than what we have presented here and would be even 
more complicated for MCMC. 

Even though our original motivation was theoretical, our method does 
work in practical examples. The bernor package can be used for analysis 
of Logit-Normal GLMM. Our method is applicable to other missing data 
models. 

APPENDIX 

A.l. Proof of Theorem 2.2. Let (mfc,?ifc) be a subsequence. We need to 
show 

K < e-liminffc Km^^n,,, < e-limsup^ Km^^^n^ < K, 
which is equivalent to 

(14) K{0)< sup liminfinfi^^,, 

BeAf{9) 

(15) K{e)> sup llTaSVL^Vcd Krn^^^rikif), 

BeAfie) k-^oo <PG-B 

where M{6) is the set of neighborhoods of 6. By condition (1) there is a 
countable basis B = -B2, • • •} for the topology of O. Choose a count- 
able dense subset 0c = {^i, ^2, • • •} by choosing 0„ G Bn to satisfy K{6n) < 
inf06B„ K{(t)) + l/n. Let Mc{0) = {B e BnM{e) :BcBenCe} where Bg is 
given by condition (3) and Cg is given by condition (4). Suprema over M{6) 
in (14) and (15) can be replaced by suprema over the countable set Mc{6)- 
By Lemma A.l below 

(16) limsupi^„,„„,(0)<ir(0) 

for each 9 with probability one, and by Lemma A. 2 below 

ux,y) 



(17) liminfinfK™, ,„,((/.) > -Q log P sup , 

for each B G Nc{0) with probability one. Since Oc and ljgg0 A/'c(0) are count- 
able and since a countable union of null sets is a null set, we have (16) and 
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(17) simultaneously on @c and Ueee-^c(^) with probability one. U B £ B 
and 9 £Br\ @c, then by (16) 

K{9) > limsupKmfe,nfc(6') > limsup inf Kmk,nk{4')- 

k k-^oo <t>^B 

Hence 

sup inf K{(j))> sup limsup inf Kmj. , rife (</>)• 

The term on the left-hand side is K[9) by lower semicontinuity of K (Lemma A. 3 
below) and by the construction of Qc- This proves (15). We also have 

sup liminf inf i^mfe,nfc(0) > sup — QlogPsup ' ^ 



= -Q\ogP inf sup ^f^^'^\ 

where the inequality follows from (17), the first equality from the mono- 
tone convergence theorem and the second equality from condition (2). This 
proves (14). 

Lemma A.l. Under condition (5) of Theorem 2.2, Km,n{G) K{9). 
Proof. Since /e,,n(y)//e(y) - 1 = {F^- P)fe{-\y)/h{-) by condition (5), 

\\fe,mi■)/fe{■)-M\y^^^O 
by Lemma 1.9.2 in [32]. This implies 

(18) sup \Km,n{G) - 

nGN 

since Km,n{G)-Kn{e) = i Ei=i log[fe{Yj)/fg,m{Yj)]. Since Kn{9) K{9) 
by the strong law of large numbers, the result follows by the triangle inequal- 
ity. □ 

Lemma A. 2. Under conditions (3) and (4) of Theorem 2.2, 

(19) hminf inf K^,„.(,^) > -QlogPsup [f^'J^ 

(m,n)^(oo,oo)<^GB ^(zB h{X)g{Y) 

with probability one for each subset B of BgOCg. 
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Proof. By condition (3) the term on the right-hand side in (19) is not 
— oo. Next 

inf (cj)) = - sup Qn log ¥m ffl^'ll > -Qn log SUp "^^^^'^^ 



■f>^B h{X)g{Y) ^^Bh{X)g{Y) 

By condition (4), for any ei > and £2 > 0, there are measurable A and M G 
N such that Pr(A) > 1-ei and sup^g^ ^(•, < Psup^g^ /^(•, y)/ 

h{-)g{y) + £2 for all m > M and y £y uniformly on A. Hence 

M K,,,n{^) > -Q„log|Psup-^^ +£2 j 

for all m > M and y £y uniformly on A. By the strong law of large numbers 
on the right-hand side, there are measurable B and N £N such that Pr(iJ) > 
1 — 63 and 

for all n > uniformly on B. Hence 



inf > -Qlog<^ Psup 



+ 62} -84, 



for all m> M and n> N uniformly on AoB. We are done since the e's 
were arbitrary. □ 

Lemma A. 3. Under conditions (2) and (3) of Theorem 2.2, K is lower 
semicontinuous. 

Proof. Let be a point of and {^fc} a sequence in converging to 
e. Then 

hmsupQlog— — < hm Q log P sup 

fc^oo A.->n /l WS'l^j 

= QlogPlimsup " , 

where the equality follows from the monotone convergence theorem by con- 
dition (3). Also, 

liminfi^(efc) > -QlogPhmsup^^VT^ ^ -QlogP -^'[^'^j = K{9), 
fc^oo h{^)g{y) h{X)g{Y) 

where the last inequality follows from condition (2). □ 



MONTE CARLO LIKELIHOOD ASYMPTOTICS 



17 



A.2. Proof of Theorem 2.3. If we define 

(20) D„,„= (\^Km.n{0* + S{em.,n- e*))ds, 

Jo 

then by Taylor series expansion 

^ J^m,n{(^m,n) ^ ■^m,n(,G ) — Din,ni,^m,n ^ )• 

If we show 

(21) - + — VKmAn^^J), 
\n m J 

P 

then since Dm,n — > J by Lemma A. 6 below, eventuahy n '^i^^ exist, and 
by Slutsky's theorem 

fV W\'^/'^ - (V iy\"^/^ 1 

- + — J(em,n-r) = - - + — jz)-i„vir^,n(r) + op(i) 

\n m / \n m J 

-^MiO,I). 

If we prove (21) under the condition n/{m + n) — > a, the subsequence prin- 
ciple gives us (21) without this condition. If < a < 1, then (m + n)(y^/n + 
W/m) ^V/a + W/{l-a). Since 

VKra,nm = -QnVlogfe*{Y)-qnVlogF„Je*{X\Y)/h{X), 

we have y/m + nS/ Km,n{0*) -^M{Q,V/a + W^/(l - a)) by Lemma A.4 be- 
low, and in turn we have (21) by Slutsky's theorem. The a = and a = 1 
cases are similar. 



Lemma A. 4. Under conditions (1) through (3), (5) and (6) of Theo- 
rem 2.3, 

f V^Qn V log fe*{Y) \C rf fV 

^^^^ [V^Qn\/logFmfe*{X\Y)/h{X)J '^"'1,0 W 

Proof. By condition (5), P in 1°°{J^2) and by condition (6), 

Gm — > Gp in /°°(.7^3), where Gm = \A^(lPm. — P) and Gp is a tight Gaussian 
process in 1°°{J^3) with zero mean and covariance function E{Gpf ■ Gpg) = 

Pfg- PfPg. By Slutsky's theorem ([32], Example 1.4.7), (Pm,G„) ^ 

(P,Gp) inB) = /~(.F2) xZ°°(J3). 

By the almost sure representation theorem ([32], Theorem 1.10.4 and Ad- 
dendum 1.10.5), if {Q,A,Pr) is the probability space where Pj^ are defined 
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(Pr can be P°°), there are measurable perfect functions (pm on some prob- 
ability space (il,^,Pr) such that the following diagram commutes 



{IPTji,Gm) 

n — ^ D 




and Pr = Pr oj/)-i^and (P™,G^) ^ (Poo, Goo) in B, where (Poo,%) = 
(P,Gp) and (P 005^00) — (-PiGp). Hence for almost all lD, sup,ygy |(Pm — 
P)i^)fe4-\y)/H-)\^0 and snpy^y\{Gm - Gp)icd)V fe*{-\y)/h{-)\ ^ 0. By 
the uniform continuity of {s,t) 1— > t/s on [so,oo) x M with sq > 



(23) sup 



^^)V/0.(-|?/)//i(- 



0. 



Pm(a;)/e-(-|y)//i(-) 
If we define 

(24) kiu;,y)=Gp{u;)Vfe*i-\y)/h{-), 

and show y ^ k{uj,y) is bounded and continuous for almost all uj, then the 
second term on the left-hand side in (23) [which equals k{uj, •) = k{(f)^{oj), •)] 
is bounded and continuous for almost all Co. By Lemma A. 5 below, 
Qn{v)^p{Co)Vfe*{X\Y)/h{X)^QGp{Cj)Vfe'{X\Y)/h{X) for almost all r? 
and u), and this with (23) leads to 



VrnQn(r?)VlogP„(cD)V/,*(X|y)//i(X) ^ QGp(cu)V/9.(X|y)//i(X) 

(25) 

for almost all -q and Co. Even though first m — > 00 and then n — > cx), the limit 
can be shown to be the same (by a triangle inequality) no matter how m 
and n go to infinity because of the uniformity in (23). 

The function y 1— > k{uj,y) is bounded since supy^y \k{LO , y)\ = 
||Gp(a;)||jF3 < 00 from Gp{lo) G Every subscript i refers to the ith 

coordinate in W^. For almost all uj, the sample path / Gp{uj)f is p- 
continuous on J^^i ([32], Section 1.5), where p{f,g) = {P{f — f?)^}^^^- The 
function y 1— > [V/6)*(-|y)//i(-)]i from 3^ to {Tzi,p) is continuous by condi- 
tion (2) and the dominated convergence theorem applied to p{\S^ fe*{'\yn) / 
h{-)\iA^ fe*{-\y)/K-)]if <^P{Fi) <oo with yn^y and F in condition (6). 
The function y 1-^ ki{io, y) is a composition of the two continuous functions, 
hence continuous, for almost all oj. 

By the central limit theorem V^Q^V log fe* {Y) AA(0, V) , and if (iJ, 
Qr) is the probability space where are defined (Qr can be (5°°), there is 
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an almost sure representation for this with commutative diagram 




and Qr = Qr o ip^^ . If we combine this representation with (25), 

^Qn{ri)V log fg*{Y) \ 
mQ„(r/)VlogP„(cu)V/e*(X|y)//i(X)y' 



ziv) 

^QGp{^)Vfe*{X\Y)/h{X)^ 

for almost all i) and tD, where Z{fi) is M{Q,V). In this representation, it is 
clear that the two terms on the right-hand side, being functions of inde- 
pendent random variables, are independent. This almost sure convergence 
implies weak convergence, and undoing the almost sure representation gives 

V^Q„Vlog/r(n \ cf Z \ 



mQ„VlogP™,V/e.(X|y)//i(X)J \QGpV fe*{X\Y) /h{X) ) 



We are done if we show that the second term on the right-hand side is 
M{Q,W). 

Let r = QGpV/e*(X|y)//i(X). Note T {oj) = Qk{uj , ■) with k in (24). By 
condition (2) there is a sequence {Qi} of probability measures with finite 

support such that Qi — >Q ([1], Theorem 14.10 and Theorem 14.12). Let 
Ti{uj) = Qik{uj, •). Then Ti{uj) — > T{uj) for almost all lo because y i— > k{uj,y) 
is bounded and continuous for almost all w. Since Gp is a Gaussian pro- 
cess, Ti is normally distributed. By condition (6), {y, s) ^ E[k{-,y)k{-, s)'^] 
is bounded and continuous by the dominated convergence theorem. Hence 
varTj — > varT, and by Pubini the limit equals W. Now for any t G M'^ 

exp(-t^(varTi)i/2) ^exp(-t'^m/2). Hence T, -^J\f{0,W) and T ~ AA(0, T^). 
□ 

Lemma A. 5. Under condition (2) of Theorem 2.3, Qn — > Q almost 
surely. 

Proof. Let ;S be a countable basis for 3^ and A be the set of all finite 
intersections of elements of B (also countable). For each A G A we have 
Qn(^) Q{^) by the strong law of large numbers. Hence, a countable union 
of null sets being a null set, this holds simultaneously for all A € A. The 
result follows since ^ is a convergence determining class ([4], Theorem 2.2). 
□ 
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p 

Lemma A. 6. Under conditions (4) through (8) of Theorem 2.3, D^.n — ^ 
J, where Dm,n is defined by (20) and J in condition (3) of Theorem 2.3. 

Proof. First note 

\D^,n -J\< j\v^Km,n{e* + s{§^,n " 0*)) + QV^ log ^ ) | ds 

+ ^sup J Qv^ log /,.^,(,-^ (y ) - Qv^ log (y) I . 

By condition (4), QV'^ log fg{Y) is continuous on Sp ([9], page 110). Hence 
the second term on the right-hand side converges in probability to zero by 
the weak consistency of 9m,n- The first term on the right-hand side will also 
converge in probability to zero because for any e > 

Pr^ (\v^Kra,n{e* + s{e,n,n " 9*)) + QV^ log /,._^^(,~ _e*^{Y)\ ds > 8 

(26) 



< Pr(^^,„, ^ Sp) + Pr sup iV'Km.niO) + QV' log fe(Y)\ >e], 
\eeSp ' J 

if we show the second term on the right-hand goes to zero. Note V'^ Km,n{G) = 
-Q„V2 log fe{Y) - QnWmiO, Y) where 

^ ¥^V^fe{-\y)/h{-) _ {FmVfe{-\y)/h{-)}{Frr,Vfe{-\y)/h{.)}'^ 

'"^ '^^ FM\y)/H-) {^mfe{-\y)/h{-)r 

By condition (4), snpg^s, IQnV^ log/e(y) - QV^ log/,(y)| ^ 0. Hence the 
second term on the right-hand side in (26) will go to zero, if we show 

SUpg^sJQnWm{0,Y)\^-^0. 

By condition (7) 
(27) supsup|F^v2/e(-|y)A(-)l ^0. 

Expanding P.mV/e(-|y)//i(-) as 

^mVfe{-\y)/h{-)=FmVfe*{-\y)/H-) 

+ 1^ ¥m^^fe*+sie^e*)i-\y)/H-){e - 6*) ds 

leads to, for any 9 £ Sp, 

sup \¥mVfe{-\y)/h{-)\ < sup |P^V/e. (-jy) //i(-)l 
yey y&y 



+ sup sup|P™,V2/e(-|y)//i(-)|P- 

eeSp y&y 
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The first term on the right-hand side converges almost surely to zero because 
^3 is P-Glivenko-Cantelli from being P-Donsker [condition (6)]. Since the 
second term on the right-hand side also converges almost surely to zero by 

(27) , 

(28) supsup|P^V/e(-|y)//i(-)| ^0. 

With condition (5) and (28), supgg_5^ sup^^g^y |Pm/e('|y)/^(') — 1| ^^^0, and 

this, (27) and (28) imply supgg_5^ sup^^^-y | Wm(^, 2/)| ^^^^0. We are done be- 
cause SUY>e(iSp\'^nWm{0,Y)\ <SUp0g5^SUPygy|VFm(6l,J/)|. □ 
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